Crucial to all good statistical analyses is a thorough exploratory analysis. In this document, we will introduce a few techniques for developing an understanding of our dataset, including an understanding of its limitations. We will also present many of the data preparation steps needed to be able to fit the models in the subsequent tutorials.
Let’s load all the required R packages. See our tutorial 0_Installing_packages.html for installation information.
library(rgeos)
library(rgdal)
library(sp)
library(maptools)
library(spatstat)
library(spdep)
library(INLA)
library(inlabru)
library(readxl)
library(lubridate)
library(ggmap)
library(raster)
If you have not done so already, you need to download the files associated with the workshop.
All the workshop script and tutorial files can be found on Github at github.com/joenomiddlename/DFO_SDM_Workshop_2020. To download these files on your computer, press on Code and then Download ZIP. Once downloaded, unzip the DFO_SDM_Workshop_2020-main.zip file. To be able to access some of the precompiled data, we need to make the unzipped folder the working directory. To do so in R Studio, navigate to the correct folder using the bottom right panel (folder view, you can use … button) and open it. Then, click “Set as Working Directory” under the tab ‘More’.
We should be inside the folder titled: ‘DFO_SDM_Workshop_2020’, you can verify this by using getwd().
Now we can load in the precompiled data
list2env(readRDS('./Data/Compiled_Data_new.rds'), globalenv())
## <environment: R_GlobalEnv>
Throughout this workshop, we use data collected on fin whales in June, July, and August across the years 2007-2009 and for 2011. There are two main types of data:
Encounters from systematic aerial surveys, found in Sightings_survey. The encounters information contain counts of fin whales, although for this tutorial we simplify the analysis and only model presence/absence information. These systematic surveys also contain information on the distance of the encounters from the plane. Furthermore, they are accompanied by the aerial tracklines, found in Effort_survey. In these data objects, we combined three surveys conducted by both DFO and NOAA: 2007 T-NASS Aerial Survey (Lawson et al. 2009), NOAA NARW Surveys (Cole, NOAA), and NOAA Cetacean Surveys (Cole et al., NOAA).
Opportunistic fin whale encounters from two whale-watching operators, found in Sightings_DRWW_sp. This dataset is not accompanied by tracklines. Again, these datasets contain whale counts, but for simplicity we will analyze them as presence/absence data in these tutorials. These encounters are maintained in the DFO Maritimes Region Whale Sightings Database (MacDonald et. al. 2017).
Distance from port for both the Quoddy and Brier whale-watch vessels, found in Dist_Brier and Dist_Quoddy. These covariates will be used to model effort from the whale-watch vessels. These distances from port objects are accompanied by the locations of the ports, found in WW_ports.
Bathymetric slope data, found in Slope. Slope data will be our main covariate in our analysis and we will explore whether the distribution of fin whales is linked to this.
Study area, found in Domain. A polygon that describes the limits of the study area.
These data are spatially-referenced data. First, we must always check for consistency between the coordinate reference systems (CRS) of each spatial object in use! To print the CRS of an sp object, simply add @proj4string to the name of the spatial object and run.
Sightings_DRWW_sp@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Sightings_survey@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Effort_survey@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Domain@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Slope@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
WW_ports@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Dist_Brier@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Dist_Quoddy@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
All of the spatial objects are in lat/lon - good! For future analysis we will be projecting the data into a different coordinate reference system to better preserve euclidean distance.
Finally, let’s turn off all warnings associated with coordinate reference systems. The latest PROJ6+/GDAL3+ updates have caused many warning messages to be printed.
rgdal::set_rgdal_show_exportToProj4_warnings(FALSE)
rgdal::set_thin_PROJ6_warnings(TRUE)
options("rgdal_show_exportToProj4_warnings"="none")
Our first goal is generally to plot the data on a map. The gg() and gmap() functions from the inlabru package are extremely useful at plotting spatial data! The class of spatial objects we will use are from the sp package. The classes of these objects begin with ‘Spatial’. For example SpatialPointsDataFrame.
Let’s plot our data!
If the data are in lat/lon format then the gmap() function will automatically add a terrain layer to the plots. Let’s plot the survey encounters in blue, survey tracklines in black, whale-watch encounters in purple, and whale-watch ports in red.
gmap(Sightings_survey) +
gg(Domain) +
gg(Effort_survey) +
gg(Sightings_survey, colour='blue') +
gg(Sightings_DRWW_sp, colour='purple') +
gg(WW_ports, colour='red')
Some of the maps used by default in gmap are copyrighted (e.g., Google maps, see ?get_map), see Additional tips for ways to use open-source maps for publication.
To remove the map layer, simply replace the gmap(Sightings_survey) with ggplot().
ggplot() + # Notice the empty ggplot() call
gg(Domain) +
gg(Effort_survey) +
gg(Sightings_survey, colour='blue') +
gg(Sightings_DRWW_sp, colour='purple') +
gg(WW_ports, colour='red')
This plot hides some crucial information regarding the data collection. For example, the survey encounters and tracklines do not come from a single survey, or even a single organization!
table(Effort_survey$DATASET)
##
## DFO NOAA_1 NOAA_2
## 49 54 70
Let’s plot the effort from each survey. The easiest way to do this is to subset the data accordingly!
ggplot() +
gg(Domain) +
gg(Effort_survey[Effort_survey$DATASET=='DFO',], colour='purple') +
gg(Effort_survey[Effort_survey$DATASET=='NOAA_1',], colour='red') +
gg(Effort_survey[Effort_survey$DATASET=='NOAA_2',], colour='yellow')
We see that the DFO tracklines (in purple) do not overlap with the two NOAA surveys! Such spatial segregation is problematic and any future model will be unable to identify differences in protocol efficiency without strict assumptions. Why? Suppose the whale intensity is estimated to be higher in the DFO survey. Here, it would be challenging to determine whether or not the cause was in fact due to a higher whale intensity in the DFO-covered waters, or simply due to differences in survey protocol! More on this later!
In addition, the surveys were conducted across 4 separate years. Try plotting the survey tracklines by year on your own.
Hint: the names of the variables in the Effort_Survey are: DATASET and YEAR. The YEAR variable is of type character and contains 4 unique values (see below).
table(Effort_survey$YEAR)
##
## 2007 2008 2009 2011
## 101 20 19 33
class(Effort_survey$YEAR)
## [1] "character"
If you get stuck, click ‘Show Code’.
ggplot() +
gg(Domain) +
gg(Effort_survey[Effort_survey$YEAR=='2007',], colour='purple') +
gg(Effort_survey[Effort_survey$YEAR=='2008',], colour='red') +
gg(Effort_survey[Effort_survey$YEAR=='2009',], colour='blue') +
gg(Effort_survey[Effort_survey$YEAR=='2011',], colour='yellow')
Do you see any cause for concern?
The surveys from years 2007, 2008 and 2009 covered largely different regions! Again, this could be problematic if we wanted to model any changes in the whale distribution over time! As with the previous plot, estimating differences in the whale intensity over the years will be challenging due to the observers visiting different regions each year. For identifying a year effect, it helps to have significant spatial overlap in the efforts across the years.
That being said, the data from 2011 appear to be a good candidate for model comparison as the spatial range overlaps with the other 3 years’ effort. We will holdout this data as our test data and use 2007, 2008, and 2009 as our training data.
xtabs(~ YEAR + DATASET, data=Effort_survey@data)
## DATASET
## YEAR DFO NOAA_1 NOAA_2
## 2007 49 3 49
## 2008 0 20 0
## 2009 0 19 0
## 2011 0 12 21
2011’s data comes exclusively from NOAA.
For modelling, we will transform the data from lat/lon into a new “Canadian” CRS: NAD83 datum with UTM Zone 20N projection. This projection will help to preserve euclidean distance between points. We define the CRS object for this projection with EPSG code 2961.
Can_proj <- CRS("+init=EPSG:2961")
Can_proj <- fm_crs_set_lengthunit(Can_proj, unit='km')
The second line of code specifies that we want to work in units of km instead of the default meters. This can prove vital in applications to avoid numerical overflow.
To do the transformation, we will use the spTransform() function. For example, we transform the whale-watch encounters spatial object Sightings_DRWW_sp as follow:
Sightings_DRWW_sp <- spTransform(Sightings_DRWW_sp, Can_proj)
Sightings_DRWW_sp@proj4string
## CRS arguments:
## +proj=tmerc +lat_0=0 +lon_0=-63 +k=0.9996 +x_0=500000 +y_0=0
## +ellps=GRS80 +units=km +no_defs
Notice the changed output when calling @proj4string.
Please repeat this for all the spatial objects that are points, lines or polygons.
Sightings_survey <- spTransform(Sightings_survey, Can_proj)
Effort_survey <- spTransform(Effort_survey, Can_proj)
WW_ports <- spTransform(WW_ports, Can_proj)
Domain <- spTransform(Domain, Can_proj)
Transforming the raster-like SpatialPixelsDataFrame objects (Slope, Dist_Brier, and Dist_Quoddy) using spTransform would be inappropriate because the projection leads to a curvature of the pixels. A more appropriate approach is to use bilinear interpolation through the projectRaster() function from the raster package. This requires converting the SpatialPixelsDataFrame object into an object of type raster using the function raster(). Finally, to convert the raster object back into a SpatialPixelsDataFrame, we can use the as() function from the maptools package. We use the as() function substantially throughout the workshop for converting spatial objects between the popular packages (e.g., sp, raster, sf).
Slope <- as(projectRaster(raster(Slope), crs=Can_proj), 'SpatialPixelsDataFrame')
Dist_Brier <- as(projectRaster(raster(Dist_Brier), crs=Can_proj), 'SpatialPixelsDataFrame')
Dist_Quoddy <- as(projectRaster(raster(Dist_Quoddy), crs=Can_proj), 'SpatialPixelsDataFrame')
We will plot the (transformed) slope and distance from port spatial objects. We are going to combine these into a single plot using the multiplot() function from the inlabru package. This function takes as input ggplot objects and an argument layout, specifying how the plots should be arranged (see Additional tips for ways to change the layout).
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slope') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)')+
coord_fixed(ratio = 1),
layout=matrix(1:4, nrow=2, ncol=2, byrow = TRUE))
Don’t like the colour scheme? See Additional tips to learn how to define your own manually.
The Domain in its current form has a very complex and ‘wiggly’ coastline. The modeling functions from inlabru will require a triangulation mesh that covers the Domain. Unfortunately, accurately capturing the ‘wigglyness’ of the coastline with a mesh requires it to have a large number of triangles, slowing down the computation. To help reduce the number of triangles required to recreate the coastline, we will smooth the Domain using the functions gSimplify(). gSimplify() attempts to reduce the number of segments used to define the SpatialPolygonsDataFrame. The amount of reduction is determined by the argument tol=.
While we want to simplify the coastline, we still need to accurately capture the shape of the coastline. Failing to do so can lead to some of the encounters to fall outside the mesh, forcing us to discard them. To stop some of the encounters from falling outside of the mesh, we will use gBuffer() in combination to gSimplify(). gBuffer() will extend (or buffer) the boundary of the SpatialPolygonsDataFrame by an amount determined by the argument width=.
Domain_restricted <- gSimplify(gBuffer(Domain,width=15),tol=20)
# Plot
ggplot() +
gg(Domain_restricted) +
gg(Domain, color='red') +
gg(Effort_survey, color='blue')
# Does the new domain contain all the survey tracklines?
gContains(Domain_restricted,Effort_survey)
## [1] TRUE
inlabru requires that every covariate is defined at every point within the computational mesh. When the computational mesh is created later, the boundary may expand slightly. If the covariates are defined on the original domain, there may end up being points of the mesh that do not have a well-defined covariate value. By creating a second extended domain that is slightly larger than the first extended domain created above, we can ensure that a covariate value is properly defined at every value in the computational mesh. We create this extended domain:
# Create a new domain that is slightly larger - for defining covariates later
# Notice width of gBuffer is larger
Domain_restricted2 <- gSimplify(gBuffer(Domain,width=35),tol=20)
# Plot
ggplot() +
gg(Domain_restricted) +
gg(Domain_restricted2, color='green') +
gg(Domain, color='red')
# Does it contain the old
gContains(Domain_restricted2,Domain_restricted)
## [1] TRUE
Great! We have defined two new buffered and smoothed domains. The original domain (Domain) is shown in red and is where the computation will take place. The domain over which the computational mesh will be defined (Domain_restricted) is shown in black. Lastly, the domain over which the covariates will be defined (Domain_restricted2) is shown in green.
The smoothing of the coastline helps us to define a computational mesh that has fewer numerical issues. All computations will still take place over the original domain (Domain). More on that later!
Now we must extend our covariates to take ‘sensible’ values at all points in Domain_restricted2. These imputed values will not affect the model estimates as the model will be fit to the original Domain.
Our covariate Slope is well-defined on the original (un-buffered) domain as defined by the SpatialPolygons Domain. At values outside this, we choose to ‘fill-in’ these points with nearest-neighbour imputations. Next, we redefine the covariate as a log Slope variable, which we call log_Slope. We take the logarithm to reduce the range of values the covariate can take as these could cause our models to make wild predictions in the future tutorials. We discuss the reasons for taking a log transform in more depth later in this session.
## define log_slope covariate
log_Slope <- Slope
log_Slope$FIWH_MAR_Slope <- log(log_Slope$FIWH_MAR_Slope+1e-2)
names(log_Slope) <- 'log_Slope'
# 1) Define a set of pixels across our modified domain
pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted2, n=100000),proj4string = Domain@proj4string),'SpatialPixels')[Domain_restricted2,]
# 2) Extract values of the covariate at the new pixel locations
pixels_Domain$log_Slope <- over(pixels_Domain,log_Slope)$log_Slope
# 3) impute missing values with the nearest neighbour value
pixels_Domain$log_Slope[is.na(pixels_Domain$log_Slope)] <-
log_Slope$log_Slope[nncross(as(SpatialPoints(pixels_Domain@coords[which(is.na(pixels_Domain$log_Slope)),]),
'ppp'),as(SpatialPoints(log_Slope@coords),'ppp'), what = 'which')]
log_Slope <- pixels_Domain
# 4) Create a squared log depth covariate (for later)
log_Slope_sq <- log_Slope
names(log_Slope_sq) <- 'log_Slope_sq'
log_Slope_sq$log_Slope_sq <- log_Slope_sq$log_Slope_sq^2
# Plot the covariates with Domain_restricted overlayed
ggplot() + gg(log_Slope) + gg(Domain_restricted)
We also need to buffer and rescale the distance from port covariates Dist_Brier and Dist_Quoddy. Unlike with previous covariates, however, we will fix all buffered values on land equal to a large constant. We will see later that the analysis will not be impacted by this choice of constant.
We show how to do it with the distance to Brier.
# 1) Define a set of pixels across our modified domain
pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted2, n=100000),proj4string = Domain@proj4string),'SpatialPixels')[Domain_restricted2,]
# Extract the average value of distance at the newly created pixel locations
pixels_Domain$Dist_Brier <- over(pixels_Domain,Dist_Brier)$Dist_Brier
# There are some missing values due to newly created pixels being on land
# Fill in missing values with a very large value (1000km).
pixels_Domain$Dist_Brier[is.na(pixels_Domain$Dist_Brier)] <- 1e3
Dist_Brier <- pixels_Domain
# There is an infinity value at the port. Change to 0
Dist_Brier$Dist_Brier[is.infinite(Dist_Brier$Dist_Brier)] <- 0
max(Dist_Brier$Dist_Brier)
## [1] 1000
# Let's scale the Dist covariates closer to the (0,1) scale
Dist_Brier$Dist_Brier <- Dist_Brier$Dist_Brier / 980.7996
ggplot() + gg(Dist_Brier) + gg(Domain)
Now, please rescale and buffer the distance to Quoddy object.
pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted2, n=100000),proj4string = Domain@proj4string),'SpatialPixels')[Domain_restricted2,]
pixels_Domain$Dist_Quoddy <- over(pixels_Domain,Dist_Quoddy)$Dist_Quoddy
pixels_Domain$Dist_Quoddy[is.na(pixels_Domain$Dist_Quoddy)] <- 1e3
Dist_Quoddy <- pixels_Domain
# There is an infinity value at the port. Change to 0
Dist_Quoddy$Dist_Quoddy[is.infinite(Dist_Quoddy$Dist_Quoddy)] <- 0
# Let's scale the Dist covariates closer to the (0,1) scale
Dist_Quoddy$Dist_Quoddy <- Dist_Quoddy$Dist_Quoddy / 980.7996
ggplot() + gg(Dist_Quoddy) + gg(Domain)
In the first lecture, we defined the species’ intensity surface \(\lambda_{true}(s)\). We interpret \(\lambda_{true}(s)\) as an ‘expected encounter rate per unit effort around s’. Consequently, \(\lambda_{true}(s)\) is proportional to the true species space use as it is proportional to the species’ utilization distribution \(\pi(s)\). Loosely speaking, \(\lambda_{true}(s)\) helps to answer the question ‘where are the species?’ and is the target quantity we wish to estimate!
However, we believe that the effort from our observers is heterogeneous across our domain \(\Omega\). This heterogeneous effort makes the estimation of \(\lambda_{true}(s)\) a challenge, since both the effort and the true species’ space use both drive the observed encounter locations. To control for effort, we attempt to estimate/approximate an effort intensity surface \(\lambda_{eff}(s)\). This defines ‘the expected amount of observer effort around s’. Loosely speaking, \(\lambda_{eff}(s)\) answers the question ‘where do the observers look?’. Both of these surfaces will be linked to a set of covariates (and later we’ll even allow for random fields and/or random effects to be included).
For well-designed surveys, e.g. distance-sampling surveys, we have additional data available on the perpendicular distance of the encountered animal to the observer. We can use this to jointly model a detection probability function \(p(d)\). \(p(d)\) models the ‘probability that an individual of the species is detected by an observer, given it is a distance \(d\) away’. inlabru, the package we will use in the later workshops, handles all of the necessary integrations required to jointly model the encounter locations with the encounter distances.
Finally, we defined the observed intensity of the encounters \(\lambda_{obs}(s)\) as ‘the expected density of encounters per unit area around point s’. This quantity is not the target of our inference as it is confounded by effort. Loosely speaking, \(\lambda_{obs}(s)\) helps to answer the question ‘where do we expect encounters between the species and observers to occur?’. If our approximation of the effort, \(\lambda_{eff}(s)\), and the fitted detection probability function \(p(d)\) are good, then we hope to see:
\[\lambda_{obs}(s) \approx \lambda_{true}(s) \lambda_{eff}(s)\]
or, for a single survey trackline segment: \[\lambda_{obs}(s) \approx \lambda_{true}(s) \lambda_{eff}(s) p(d(s)).\]
The above product decomposition reads loosely as “the expected number encounters around a point s equals the amount of time the observers spent looking around point s multiplied by the expected rate the species visits location s”. Note that \(d(s)\) above defines the perpendicular distance of the point \(s \in \Omega\) from a chosen trackline segment. Since \(d(s)\) is defined uniquely for each trackline segment, the integration must take place over each trackline segment separately. This is especially important when tracklines overlap! Later in workshops 2 and 3, we will see that inlabru will do this for us.
Later, we will be fitting spatial point process models to the encounters. A requirement for using these models is that, conditional on \(\lambda_{eff}(s)\), each encounter with the animal is an independent snapshot from its true intensity \(\lambda_{true}(\textbf{s})\).
Planned surveys can be designed to (approximately) satisfy the independence assumption. The speed of an aircraft or vessel can be fixed at a high value relative to the individuals’ speeds (no double-counting) and a narrow perpendicular field-of-view can be enforced. Then, assuming that the individuals of the target species are in equilibrium with respect to their stationary distribution \(\lambda_{true}(\textbf{s})\) (i.e. the species’ density) and locally mixing faster than the time between revisits by observers, the independence assumption may reasonably hold.
Conversely, no such assumptions can be made regarding the whale-watch encounters. The unknown positions of the whale-watching boats when they are not with whales, the 360 degree fields-of-view on-board, the variable boat speed, and the repeated encounters of the same individuals throughout each day can all invalidate the independence assumption required to fit the desired models.
In our workshop, we will see some potential solutions to address this issue of independence. In this session, we will perform some exploratory analyses to investigate the suitability of the independence assumption, and to investigate possible ways of estimating observer effort.
Let’s look at the whale-watching data. Specifically, let’s investigate the 2nd, 3rd and 4th encounters in the dataset.
Sightings_DRWW_sp@data[2:4,c('WS_TIME','WS_DATE','PLATFORM')]
## # A tibble: 3 x 3
## WS_TIME WS_DATE PLATFORM
## <Period> <dttm> <chr>
## 1 16H 5M 0S 2007-06-19 00:00:00 BRIER ISLAND WHALEWATCH
## 2 16H 23M 0S 2007-06-19 00:00:00 BRIER ISLAND WHALEWATCH
## 3 16H 32M 0S 2007-06-19 00:00:00 BRIER ISLAND WHALEWATCH
Notice that they were made on the same day and very close together in time. Often, little information is provided on the database management procedures. For example, are these multiple encounters with the same individual? Would the database discard or retain problematic repeated encounters? Without additional information available on the database management protocols, we can investigate whether these encounters could be of the same individual. To investigate this hypothesis, we can compute the distance between the encounters using the gDistance() function, compute the time between the encounters, and then use these values to estimate the required travel speed of the individual.
# How far away are these encounters in space (in kilometers)?
gDistance(Sightings_DRWW_sp[2:4,], byid = T)
## 1 2 3
## 1 0.000000 1.316181 3.637082
## 2 1.316181 0.000000 2.386592
## 3 3.637082 2.386592 0.000000
# How far away are these encounters in time (in minutes)?
Sightings_DRWW_sp[3:4,]$WS_TIME-Sightings_DRWW_sp[2:3,]$WS_TIME
## [1] "18M 0S" "9M 0S"
# Could this be the same animal? How fast is this implied movement?
gDistance(Sightings_DRWW_sp[2:4,], byid = T)[cbind(c(1,2),c(2:3))] /
(c(18, 9)/60)
## [1] 4.38727 15.91061
This hypothesis would imply a travelling speed of between 4km/h and 16km/h. Is this plausible? Since fin whales are known to sustain a much higher speed (> 30km/h), we will discard all repeated encounters each day. We will keep only the first sighting made each day by each company.
Below, we see no overlap between the sighting locations from the two whale-watch companies. Could this suggest that an assumption of independent encounters between the two companies may be reasonable?
Sightings_DRWW_sp$PLATFORM <- as.factor(Sightings_DRWW_sp$PLATFORM)
ggplot() + gg(Domain) + gg(Sightings_DRWW_sp, colour=Sightings_DRWW_sp$PLATFORM_CODE)
Perhaps. Further scrutiny reveals that the distance between the average locations of the encounters made by the two companies is around 80km:
# What is the average distance between the encounters from the two companies?
gDistance(gCentroid(Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='BRIER ISLAND WHALEWATCH',]),gCentroid(Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='QUODDY LINK',]))
## [1] 78.27287
Given the speed at which fin whales travel, in a real analysis we should investigate this further; a violation of this assumption could result in overestimating the densities of whales in the whale-watching areas.
However, for simplicity we will assume that the encounters between the two companies are independent. We subset the data to keep only the initial encounters from each company:
Sightings_Brier_nodup <- Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='BRIER ISLAND WHALEWATCH',]
Sightings_Brier_nodup <- Sightings_Brier_nodup[!duplicated(Sightings_Brier_nodup$WS_DATE),]
Sightings_Quoddy_nodup <- Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='QUODDY LINK',]
Sightings_Quoddy_nodup <- Sightings_Quoddy_nodup[!duplicated(Sightings_Quoddy_nodup$WS_DATE),]
We now have a total of 229 whale watch encounters after the subsetting procedure, down from an original count of 567 encounters, see:
# Original dataset
dim(Sightings_DRWW_sp)[1]
## [1] 567
# Initial encounters from Brier
dim(Sightings_Brier_nodup)[1]
## [1] 85
# Initial encounters from Quoddy
dim(Sightings_Quoddy_nodup)[1]
## [1] 144
Strictly speaking, for the desired point process models to be reasonable, it is required that the initial daily encounter locations from the whale-watch vessels are independent realizations from \(\lambda_{true}\), conditioned on \(\lambda_{eff}\). We will assume that the typical journeys/routes of the whale-watch vessels (and hence \(\lambda_{eff}\)) remains constant across months and years under study.
As we discussed earlier, we are unable to model temporal changes in the species density due to the spatially non-overlapping survey efforts across the months and years. Thus any changes in the species intensity from June - August, or across the years 2007-2009 will be missed. Thus, we are forced to assume that the summer species density remains constant between 2007-2009. It is the summer species density that is our target for inference.
How reasonable is our assumption that the species density remains constant across the years and across the months? Again, we can visually assess the suitability of the assumption with a plot. We use the whale watch data, with its effort assumed constant through time, to investigate if any changes occur in time.
So, we are assuming that the whale watch effort (\(\lambda_{eff}(\textbf{s})\)) is constant through time. We now want to empirically test our hypothesis that the true whale intensity (\(\lambda_{true}(\textbf{s})\)) is also constant across time. If our hypothesis is true, then our (initial daily) whale-watch sighting locations should appear as approximately independent, with an intensity (\(\lambda_{obs}(\textbf{s})\)) that is also constant in time. Why? Because \(\lambda_{obs}(\textbf{s}) = \lambda_{true}(\textbf{s})\lambda_{eff}(\textbf{s}).\)
A consequence of our hypothesis is that we should see no association between the spatial distances between the whale-watch encounters and how far apart in time they were made. Thus, we can empirically test for this; evidence of a distance-time trend would not be consistent with our assumption that the fin whales space use does not change across the months or years. Furthermore, evidence of a trend around day 1 could indicate that subsetting the data by 1 day has not removed all the autocorrelation due to animal movement.
To plot distances against time intervals, we first compute the distances in time and the euclidean distances in space between each sighting.
difftime_Brier<- dist(as.numeric(ymd(Sightings_Brier_nodup$WS_DATE)))
diffspace_Brier <- dist(Sightings_Brier_nodup@coords)
difftime_Brier_fac <- as.factor(difftime_Brier)
difftime_Quoddy <- dist(as.numeric(ymd(Sightings_Quoddy_nodup$WS_DATE)))
diffspace_Quoddy <- dist(Sightings_Quoddy_nodup@coords)
difftime_Quoddy_fac <- as.factor(difftime_Quoddy)
We will create multiple plots with days on the x-axis and km on the y-axis. The first plots are restricted to only consider temporal differences of less than 85 days. We are plotting a sequence of boxplots to help detect patterns.
ggplot(data=data.frame(difftime =
difftime_Brier_fac[as.numeric(difftime_Brier)<85],
diffspace = as.numeric(diffspace_Brier)[as.numeric(difftime_Brier)<85]),
aes(y=diffspace, group=difftime)) +
geom_boxplot() +
xlab('difference in time (days)') +
ylab('difference in space (km)')
The second plot is a different version that uses a loess smoother curve to try to help uncover the trend.
ggplot(data=data.frame(difftime = as.numeric(difftime_Brier)[as.numeric(difftime_Brier)<85],
diffspace = as.numeric(diffspace_Brier)[as.numeric(difftime_Brier)<85]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess') +
xlab('difference in time (days)') +
ylab('difference in space (km)')
These two plots indicate some moderate autocorrelation between the encounters, with encounters made ~1 month apart typically being around 50% further away than those made 24 hours apart. Notice that the trend plateaus after ~14 days. Perhaps this trend could be due to autocorrelation remaining in the data due to the persistence in animal movement? This may suggest that the data need to be further subsetted to achieve approximate independence. Looking at the plot, perhaps we should discard all whale watch encounters made fewer than 10-14 days apart.
Conversely, the flat-line trend between day ~14 and day ~75 could be evidence in favour of an approximately constant fin whale space use across the summer months. If the space use did indeed change dramatically across the months (e.g. due to migration), then we may expect to see a monotonically increasing trend over time.
The third plot investigates the space-time relationship across years.
ggplot(data=data.frame(difftime = as.numeric(difftime_Brier)[],
diffspace = as.numeric(diffspace_Brier)[]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')+
xlab('difference in time (days)') +
ylab('difference in space (km)')
This third plot shows no evidence that the average spatial separation between whale-watch encounters differs year-to-year. This supports our assumption that the spatial density can be assumed to be constant across the years, albeit within the (very) small spatial region visited by the whale-watch boats departing from Brier Island.
In summary, the plots of the Brier whale watch data have cast doubts upon our assumption that subsetting the encounters to the initial daily encounters is sufficient to establish approximately independent encounters. Conversely, the plots have supported the hypothesis that the whale intensity (within the regions visited by the whale watch vessels) is constant across the months and years. Note that in a real analysis, we would also want to verify these assumptions throughout our study region \(\Omega\) using the other opportunistic data sources (e.g. commercial vessels).
We now repeat the plots, but for the whalewatch boats that depart from Quoddy, starting with the loess smoother curve on the first 84 days.
ggplot(data=data.frame(difftime = as.numeric(difftime_Quoddy)[as.numeric(difftime_Quoddy)<85],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(difftime_Quoddy)<85]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')
In this plot, we see a large separation in the y-axis. This is caused by a single encounter that was made (far) East of the port. This is seen in the map created above, as the right-most cluster of blue points. Note that both encounters seen in the cluster were made on the same day (we only kept the initial encounter). The plot above indicates that these encounters were made over 90km from port (potentially data entry mistake), and for simplicity we will discard this encounter.
Sightings_Quoddy_nodup <- Sightings_Quoddy_nodup[which(gDistance(Sightings_Quoddy_nodup, gCentroid(Sightings_Quoddy_nodup),byid=T)<50),]
ggplot(data=data.frame(difftime =
difftime_Quoddy_fac[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50]),
aes(y=diffspace, group=difftime)) +
geom_boxplot() +
xlab('difference in time (days)') +
ylab('difference in space (km)')
ggplot(data=data.frame(difftime = as.numeric(difftime_Quoddy)[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')
These two plots also show evidence of potential autocorrelation existing up to ~14 days. Again, this could be due to persistence of animal movement, for example, due to a few whales remaining in the same area for a few days.
As we see below, there is no evidence for substantial differences in the whale intensity across the years.
ggplot(data=data.frame(difftime = as.numeric(difftime_Quoddy)[as.numeric(diffspace_Quoddy)<50],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(diffspace_Quoddy)<50]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')+
xlab('difference in time (days)') +
ylab('difference in space (km)')
In summary, the Quoddy plots show large agreement with the Brier plots. We see some evidence that residual autocorrelations remain in the encounters data. This is despite subsetting the data to the initial daily encounters. The result of this may be overdispersion or the presence of additional clustering that cannot be described by \(\lambda_{true}(\textbf{s})\) alone. This could bias our future inference and predictions of \(\lambda_{true}(\textbf{s})\).
Despite this, we push forward and continue to use our initial daily encounters for these workshops. In a real data analysis, additional information on the individual ID’s could be used (perhaps to estimate a unique group-level UD) or the data could be further subsetted. In the latter case, a sensitivity analysis could be performed with the statistical models re-fit and compared using subsets of the data following increasingly-strict subsetting procedures.
It is common for multiple competing models to be fit. Deciphering which model fits the data ‘best’ using information criterion alone (AIC, DIC, etc.,) can be problematic. Performing model comparisons using the same dataset/s that were used to fit the models and perform exploratory analysis can be problematic (overfitting can occur). An alternative approach is to test both the predictive accuracy and the estimated uncertainty on an unseen dataset.
We now subset the data into a training and test set. We choose the training set to contain all the encounters from 2007-2009. For the test data, we choose all the encounters from 2011. See Additional tips for an explanation of why the potential repeated sighting in 2011 between survey and whale-watch data means that the whale-watching data cannot be added to the training dataset. Note that we perform all remaining exploratory analysis on the training data.
Below, we split the data into training and test data.
Sightings_Brier_nodup_test <- Sightings_Brier_nodup[which(Sightings_Brier_nodup$YEAR==2011),]
Sightings_Brier_nodup <- Sightings_Brier_nodup[which(Sightings_Brier_nodup$YEAR!=2011),]
Sightings_Quoddy_nodup_test <- Sightings_Quoddy_nodup[which(Sightings_Quoddy_nodup$YEAR==2011),]
Sightings_Quoddy_nodup <- Sightings_Quoddy_nodup[which(Sightings_Quoddy_nodup$YEAR!=2011),]
Sightings_survey_test <- Sightings_survey[which(Sightings_survey$YEAR==2011),]
Sightings_survey <- Sightings_survey[which(Sightings_survey$YEAR!=2011),]
Effort_survey_test <- Effort_survey[which(Effort_survey$YEAR=='2011'),]
Effort_survey <- Effort_survey[which(Effort_survey$YEAR!='2011'),]
# How many survey sightings by year?
table(Sightings_survey$YEAR)
##
## 2007 2008 2009
## 45 17 1
Next, we evaluate a crude estimate of the total amount of survey effort spent each year, as measured by the total trackline length.
# How much survey effort (in trackline length) is there by year
by(gLength(Effort_survey,byid = T), Effort_survey$YEAR, sum)
## Effort_survey$YEAR: 2007
## [1] 11643.15
## ------------------------------------------------------------
## Effort_survey$YEAR: 2008
## [1] 571.9083
## ------------------------------------------------------------
## Effort_survey$YEAR: 2009
## [1] 819.9337
Using this effort estimate, we then crudely estimate a ‘catch per unit effort’ (CPUE) measure by year.
# Crude estimate of relative `CPUE'
(table(Sightings_survey$YEAR) /
by(gLength(Effort_survey,byid = T), Effort_survey$YEAR, sum))/
min(table(Sightings_survey$YEAR) /
by(gLength(Effort_survey,byid = T), Effort_survey$YEAR, sum))
##
## 2007 2008 2009
## 3.168989 24.372565 1.000000
The ‘CPUE’ values are highly variable! The wild differences could be due to the small efforts in years 2008 and 2009 relative to 2007.
# How many WW sightings by year
table(Sightings_Quoddy_nodup$YEAR)
##
## 2007 2008 2009
## 39 24 44
table(Sightings_Brier_nodup$YEAR)
##
## 2007 2008 2009
## 24 19 19
Alternatively, the differences in CPUE could be due to differences in survey protocols that took place each year. Or, the differences could actually reflect real differences in whale density across the three distinct regions visited each year! Without taking a model-based approach and controlling for differences in spatial effort, it is unclear how to combine the encounters from these three surveys.
As we will see in the next tutorial, by taking a model-based approach to inference, we will be able to combine the data from all three surveys to estimate the whale density across space. Unfortunately, due to the lack of spatial overlap between the surveys’ efforts (tracklines), we are unable to account for any differences between the survey protocols within a model without strong prior information available. For example, an intercept added to the model for each survey would be confounded by the spatial field. From now on, we assume that an hour of effort from each survey can be considered equivalent.
To explore which detection functions (\(p(d)\)) could be used with the survey data, we will first plot the histogram of measured distances from the plane (i.e. the observed distances of encounters from the trackline).
# What is the maximum detection distance?
ggplot(Sightings_survey@data, aes(x=DISTANCE)) + geom_histogram() + xlab('Distance (m)')
The histogram of the perpendicular distances from the aircraft shows an apparent maximum detection probability at a value of distance above 0 metres! Low detection probability close to 0 metres is common for aerial surveys since it is harder to see below the plane. Common approaches include truncating the data, and fitting a two-parameter distance sampling function that allows for a non-monotonic relationship to exist.
Given the limited size of the survey data and the scope of this workshop, we choose not to pursue either option and we instead continue with estimating a single-parameter detection probability function. This crude approach assumes the detection probability is a monotonically decreasing function of distance.
In addition, it is often advised in distance sampling applications to threshold our upper detection distances at a ‘sensible’ value. For the histogram above, it looks like 2km could be a reasonable threshold value to choose. Let’s set all values above 2km equal to 2km and scale the distances onto the km units scale. This rescaling to units of km is crucial for numerical stability. We define a variable ‘distance’ that contains the rescaled distances. Note that a few distances are missing. We impute these with the mean distance.
# Setting the distances >2000 to 2000
Sightings_survey@data$DISTANCE <-
ifelse(Sightings_survey$DISTANCE>2000 & !is.na(Sightings_survey$DISTANCE),
2000,Sightings_survey$DISTANCE)
# Needs renaming to match the formula argument
Sightings_survey$distance <- Sightings_survey$DISTANCE
# Impute missing distances with the mean
Sightings_survey$distance[is.na(Sightings_survey$distance)] <- mean(Sightings_survey$distance,na.rm=T)
# Remove the old DISTANCE column
Sightings_survey <- Sightings_survey[,-c(8)]
# Rescaling to km
Sightings_survey$distance <- Sightings_survey$distance / 1000
# Plot the modified distances
ggplot(Sightings_survey@data, aes(x=distance)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The single-parameter detection probability function we will use is the half-normal detection function:
\[p(d) = exp\left(\frac{-d^2}{2\sigma^2}\right)\]
In lecture 1 we showed how we could attempt to model effort with a set of informative covariates. This approach is most useful for opportunistic datasets which lack quality information on the observers’ locations. Throughout our tutorials, we will use distance from port as a covariate of effort for the the whale-watch data.
Here, we investigate a potential functional form for the distance from port covariate. To do this, we plot a histogram of the distances from port at each of the whale watch sighting locations.
# Plot the encounters with distance from the port
hist(over(Sightings_Brier_nodup,Dist_Brier)$Dist_Brier, main='Histogram of the distance from port of the Brier encounters', xlab = 'Distance from port')
hist(over(Sightings_Quoddy_nodup,Dist_Quoddy)$Dist_Quoddy,breaks=20, main='Histogram of the distance from port of the Quoddy encounters', xlab = 'Distance from port')
For both ports, we detect a decreasing frequency of encounters made as the distance from port increases. Based on the histograms, we choose to model the functional form as a half-normal function. More complicated functional forms (e.g. weibull or hazard functions) could also be used.
More specifically, let the whale-watch effort intensity from the \(i^{th}\) port be denoted \(\lambda_{eff, i}\) and let the location of the \(i^{th}\) port be denoted \(s_i^*\). Finally, let the scalar parameter \(\lambda_i\) be unique for each port. Then we assume the following functional form for the whale-watch effort intensity from the \(i^{th}\) port: \[\lambda_{eff,i}(s) = \lambda_i exp\left(\frac{-||s-s_i^*||^2}{2\sigma_i^2} \right).\]
On the log scale, this covariate can be estimated as a linear effect on the squared distance from port. Thus, we create SpatialPixelsDataFrame objects which store the squared distances from port.
Dist_Brier_sq <- as(Dist_Brier,'SpatialPixels')
Dist_Brier_sq$Dist_Brier_sq <- Dist_Brier$Dist_Brier^2
Dist_Quoddy_sq <- as(Dist_Quoddy,'SpatialPixels')
Dist_Quoddy_sq$Dist_Quoddy_sq <- Dist_Quoddy$Dist_Quoddy^2
In this section, we will investigate trends between the observed intensity and the covariate slope using only the training data. We will also see why taking a log transform of Slope was a sensible thing to do.
A quick first step is to evaluate whether the empirical density of a covariate at the encounter locations (representing habitat used by the whales) differs from the empirical density across the domain (representing available habitat values). Such a difference may indicate that the whales are associated with a subset of the available habitat. Of course, any differences may simply reflect differences in the types of habitat visited by observers!
To extract the covariate values at each location, we will use the function over() from the sp package.
# Create a SpatialPointsDataFrame containing all the encounter locations
All_obs <- rbind(Sightings_Brier_nodup[,'YEAR'],
Sightings_Quoddy_nodup[,'YEAR'],
Sightings_survey[,'YEAR'])
# Extract the value of slope at the observation locations
Slope_obs <- over(All_obs, Slope)
First, we plot the empirical densities of Slope at the encounters locations and throughout the domain (study area).
ggplot(data=data.frame(Slope_obs = Slope_obs$FIWH_MAR_Slope),
aes(x=Slope_obs, colour="Sightings")) +
geom_density() +
geom_density(data=data.frame(Slope_dom = as.numeric(Slope$FIWH_MAR_Slope)),
aes(x=Slope_dom, colour='Domain')) + xlab('Slope') +
scale_color_manual(name = "Density", values = c('Sightings' = 'black', 'Domain' = 'red'))
We see a very skewed distribution of Slope with a huge range of values. Leaving the covariate in this form risks wild predictions of intensity being made! A log transform may reduce the skew of the distribution and help to stabilise predictions. Let’s see this:
# Extract the value of slope at the observation locations
log_Slope_obs <- over(All_obs, log_Slope)
ggplot(data=data.frame(Slope_obs = log_Slope_obs$log_Slope),
aes(x=Slope_obs, colour="Sightings")) +
geom_density() +
geom_density(data=data.frame(Slope_dom = as.numeric(log_Slope$log_Slope)),
aes(x=Slope_dom, colour="Domain")) +
xlab('log Slope') +
scale_color_manual(name = "Density", values = c('Sightings' = 'black', 'Domain' = 'red'))
These plots suggest that encounter locations are associated with regions with higher slope compared with what is available in the study area. However, these results are only investigating the associations between \(\lambda_{obs}\) and the covariates and not the desired associations between \(\lambda_{true}\) and the covariates. The associations may simply reflect the types of regions preferentially visited by observers! Note that the last plot shows a lack of overlap in the highest log_Slope values (around 2). Consequently, detected associations between log_Slope and animal density may become extrapolated beyond the range of observed values! We will address this in the later sessions.
In summary, from this exploratory analysis, we have found evidence that the log slope is positively associated with the observed intensity. However, we currently do not know if this a true species-environmental relationship, or if the association is instead caused by observers spending more time in regions with higher slope. In the next two workshops, we will develop a model-based approach using the inlabru package that can control for the effects to better investigate species-environmental trends. Finally, Exercise 3 below offers an alternative approach for graphically assessing species-environment trends that may better control for effort.
An improved approach for graphically assessing possible univariate species-environment trends is possible. Repeat the above empirical density plots, but evaluate the empirical densities of log slope at the survey encounter locations and the survey effort tracklines (instead of the domain). What do you notice? Is this trend consistent with the previous empirical density plot? Which one do you trust more? Do you have any concerns with this approach?
Hint: remember to use the over() function to extract the values of log slope at a Spatial object (including spatiallines).
# Extract the value of slope at the survey encounter locations
log_Slope_enc <- over(Sightings_survey, log_Slope)
# Extract the value of slope at the survey trackline locations
log_Slope_effort <- over(Effort_survey, log_Slope)
ggplot(data=data.frame(Slope_obs = log_Slope_enc$log_Slope),
aes(x=Slope_obs, colour="Sightings")) +
geom_density() +
geom_density(data=data.frame(Slope_eff = as.numeric(log_Slope_effort$log_Slope)),
aes(x=Slope_eff, colour="Tracklines")) +
xlab('log Slope') +
scale_color_manual(name = "Density", values = c('Sightings' = 'black', 'Tracklines' = 'red'))
If you got stuck on any of the exercises, then please feel free to try them again. Here are links to the problems:
For publication, there can be issues regarding copyright of Google Maps. Using OpenStreetMap can help. To guarantee this simply add the following argument: source='osm', force=TRUE to gmap(). Double check the console that the maps are indeed being downloaded from stamen or osm. For brevity we have suppressed the messages.
gmap(Sightings_survey, source='osm', force=TRUE) +
gg(Domain) +
gg(Effort_survey) +
gg(Sightings_survey, colour='blue') +
gg(Sightings_DRWW_sp, colour='purple') +
gg(WW_ports, colour='red')
Note for this to work it needs to be in the original project (lat/lon), not UTM.
The multiplot() function is a very flexible function that enables publication-quality figures to be made with relative ease.
To change the order of the plots you can change the argument byrow=TRUE to byrow=FALSE.
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry'),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(1:4, nrow=2, ncol=2, byrow = FALSE))
We can also change the size of the figures by changing the matrix in the layout.
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry'),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(c(1,1,2,3), nrow=2, ncol=2, byrow = TRUE))
As you can see plots of different size stretches the maps, to keep the set projection we can use coord_fixed(ratio = 1).
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
coord_fixed(ratio = 1),
layout=matrix(c(1,1,2,3), nrow=2, ncol=2, byrow = TRUE))
We can define the colour palette easily.
colsc <- function(...) {
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"PuBuGn")),
limits = range(...))
}
Look at ?RColorBrewer::brewer.pal to see what other colour palettes are available.
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry') +
colsc(Slope@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
colsc(Dist_Brier@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
colsc(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors
Have a go at creating your own colour palette function. Investigate the effects of changing both arguments to brewer.pal.
colsc2 <- function(...){
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(7,"Spectral")),
limits = range(...))
}
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry') +
colsc2(Slope@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
colsc2(Dist_Brier@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
colsc2(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))
Could we not also put 2011’s whale-watch data into the training set? Below is some code showing that we cannot. A whale is sighted by at least one whale watch company on every day in 2011 that a survey detects a whale. The plot below shows that these encounters could have been of the same animal. Even if they weren’t, the tracklines in 2011 still visited areas that the whale watch vessels were present. Since our training and test datasets are required to be independent of each other, we must therefore also remove the 2011 whale watch encounters from the training data.
# what dates were survey encounters made on in 2011?
unique(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011])
## [1] "2011-06-21" "2011-08-12" "2011-08-13"
# Are encounters by the WW vessels made on these dates?
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][1],
x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][2],
x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][3],
x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
# Could they be of the same animal?
ggplot() + gg(Domain) +
gg(Sightings_survey_test[Sightings_survey_test$YEAR==2011,],colour='green') +
gg(Effort_survey_test[Effort_survey_test$YEAR==2011,],colour='yellow')
The code for hiding the Rmd code chunks came from Martin Schmelzer, found here
References/Sources for data sets:
DFO Maritimes Region Whale Sightings Database
MacDonald, D., Emery, P., Themelis, D., Smedbol, R.K., Harris, L.E., and McCurdy, Q. 2017. Marine mammal and pelagic animal sightings (Whalesightings) database: a users guide. Can. Tech. Rep. Fish. Aquat. Sci. 3244: v + 44 p.
NOAA NARW Surveys
Timothy V.N. Cole. National Oceanic Atmospheric Administration, National Marine Fisheries Service, Northeast Fisheries Science Center. 166 Water Street, Woods Hole, MA, USA
2007 TNASS DFO Aerial Survey
Lawson. J.W., and Gosselin, J.-F. 2009. Distribution and preliminary abundance estimates for cetaceans seen during Canada’s marine megafauna survey - A component of the 2007 TNASS. DFO Can. Sci. Advis. Sec. Res. Doc. 2009/031. vi + 28 p.
NOAA Cetacean Surveys
Timothy V.N. Cole and D. Palka. National Oceanic Atmospheric Administration, National Marine Fisheries Service, Northeast Fisheries Science Center. 166 Water Street, Woods Hole, MA, USA